Bengaluru News, Headlines Analysis: from 2001 to 2020

1. Initial setup

1.1. Importing required packages

In [1]:
import time

import numpy as np
import pandas as pd

import spacy

from sklearn.feature_extraction import text

from wordcloud import WordCloud
import matplotlib.pyplot as plt

import plotly.express as px
In [2]:
# Spacy essentials
nlp_en = spacy.load("en_core_web_sm")
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

1.2. Utility functions

In [3]:
def compute_time_difference(time_start, time_end):
    """
    Function compute_time_difference: To compute time difference between the two provided timestamps
    @param time_start: Start time
    @param time_end: End time
    @return: time difference string
    """
    
    # Computing time difference
    time_diff = time_end - time_start
    
    # Initializing time string to store time in seconds
    time_str = str(round(time_diff, 4))+" seconds"
    
    # Checking if the seconds value amounts to more than a minute
    if time_diff > 60:
        time_str = str(round(time_diff/60, 4))+" minutes"
        
    # Returning time difference string
    return time_str
In [4]:
# Total Notebook Run: Start Time
total_time_start = time.time()

2. Data Collection, Analysis and Upload

2.1. Obtaining dataframes from csv files

In [5]:
news_df = pd.read_csv("data/india-news-headlines.csv")

2.2. Dataframes sanity

In [6]:
print("Data dimensions:")
news_df.shape
Data dimensions:
Out[6]:
(3297172, 3)
In [7]:
print("Data types:")
news_df.dtypes
Data types:
Out[7]:
publish_date          int64
headline_category    object
headline_text        object
dtype: object
In [8]:
print("First few entries:")
news_df.head()
First few entries:
Out[8]:
publish_date headline_category headline_text
0 20010101 sports.wwe win over cena satisfying but defeating underta...
1 20010102 unknown Status quo will not be disturbed at Ayodhya; s...
2 20010102 unknown Fissures in Hurriyat over Pak visit
3 20010102 unknown America's unwanted heading for India?
4 20010102 unknown For bigwigs; it is destination Goa
In [9]:
print("Last few entries:")
news_df.tail()
Last few entries:
Out[9]:
publish_date headline_category headline_text
3297167 20200630 gadgets-news why tiktok removed 1 65 crore videos in india
3297168 20200630 entertainment.hindi.bollywood apurva asrani calls alia bhatts mother soni ra...
3297169 20200630 entertainment.hindi.bollywood kangana ranaut gets a doll version of herself ...
3297170 20200630 entertainment.hindi.bollywood meezaan jaffrey reminisces his childhood days ...
3297171 20200630 entertainment.telugu.movies.news prabhas20 titled as radhe shyam prabhas and po...
In [10]:
print("Columnar statistics:")
news_df.describe(include="all")
Columnar statistics:
Out[10]:
publish_date headline_category headline_text
count 3.297172e+06 3297172 3297172
unique NaN 1016 3082589
top NaN india Sunny Leone HOT photos
freq NaN 285619 98
mean 2.012470e+07 NaN NaN
std 4.896213e+04 NaN NaN
min 2.001010e+07 NaN NaN
25% 2.009101e+07 NaN NaN
50% 2.013071e+07 NaN NaN
75% 2.016110e+07 NaN NaN
max 2.020063e+07 NaN NaN
In [11]:
print("Checking for null values:")
news_df.isnull().sum()
Checking for null values:
Out[11]:
publish_date         0
headline_category    0
headline_text        0
dtype: int64

2.3. Text Analysis

2.3.1. Understanding Text categories and picking relevant subset of data

In [12]:
news_cat_values = news_df.headline_category.value_counts().reset_index().values
news_cat_df = pd.DataFrame(news_cat_values, columns=["headline_category", "count"])
news_cat_df.sort_index(ascending=True, inplace=True)
In [13]:
print("Data dimensions:")
news_cat_df.shape
Data dimensions:
Out[13]:
(1016, 2)
In [14]:
print("First 50 entries:")
news_cat_top_df = news_cat_df.head(50)
First 50 entries:
In [15]:
fig = px.bar(news_cat_top_df, x="headline_category", y="count")
fig.update_xaxes(tickangle=45)
fig.update_layout(title="Headline category histogram: Top 50")
fig.show()

NOTE: We will be filtering out data for city.bengaluru category.

2.3.2. Analysis of filtered data

In [16]:
bng_news_df = news_df.loc[news_df["headline_category"]=="city.bengaluru"]
In [17]:
print("Data dimensions:")
bng_news_df.shape
Data dimensions:
Out[17]:
(91857, 3)
In [18]:
print("First few entries:")
bng_news_df.head()
First few entries:
Out[18]:
publish_date headline_category headline_text
274 20010104 city.bengaluru Three in race for chief secy's post
278 20010104 city.bengaluru He's not so inscrutable
4428 20010518 city.bengaluru Don't take that biscuit; you dope
4429 20010518 city.bengaluru 'I've done my bit when it comes to preparing'
4436 20010518 city.bengaluru He is etched in the chapters of Bangalore

Word Cloud Visualization

In [19]:
# Getting all headline text into one string
bng_all_text = ", ".join([x.lower() for x in bng_news_df["headline_text"]])
In [20]:
# Generating WordCloud 
wordcloud = WordCloud(width = 800, height = 800,
                      background_color ="white",
                      stopwords = spacy_stopwords,
                      min_font_size = 10).generate(bng_all_text) 

# Plotting the WordCloud image
plt.figure(figsize = (10, 10), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 

plt.show() 

Bigrams and Trigrams

In [21]:
def get_top_ngrams(bow, max_features, ngrams, top_k):
    '''
    Get top n-grams for the given corpora
    
    @param bow: List/Bag of Words
    @param max_features: Maximum number of features
    @param ngrams: Number of grams 
        (1=> unigram, 2=>bigram, 3=>trigram)
    @param top_k: Number of top n-grams
    
    @return: Top n-grams for given corpora
    '''
    tfidf = text.CountVectorizer(input=bow,
                                 ngram_range=(ngrams,ngrams),
                                 max_features=max_features,
                                 stop_words="english")
    matrix = tfidf.fit_transform(bow)
    features = tfidf.get_feature_names()
    ngrams_result = pd.Series(np.array(matrix.sum(axis=0))[0],
                              index=features)
    ngrams_top_100 = ngrams_result.sort_values(ascending=False).head(top_k)
    return ngrams_top_100
In [22]:
# Bag of Words from bng_news_df
bng_news_text = bng_news_df["headline_text"].tolist()

Top Bigrams

In [23]:
top_k_bng_bigrams = get_top_ngrams(bow=bng_news_text, max_features=5000, ngrams=2, top_k=20)
In [24]:
top_k_bng_bigrams_df = pd.DataFrame(top_k_bng_bigrams, columns=["count"]).reset_index().rename(columns={"index": "bigram"})
In [25]:
print("Data dimensions:")
top_k_bng_bigrams_df.shape
Data dimensions:
Out[25]:
(20, 2)
In [26]:
print("First few entries:")
top_k_bng_bigrams_df.head()
First few entries:
Out[26]:
bigram count
0 year old 638
1 high court 380
2 rs lakh 261
3 ends life 233
4 karnataka high 230
In [27]:
fig = px.bar(top_k_bng_bigrams_df, x="bigram", y="count")
fig.update_xaxes(tickangle=45)
fig.update_layout(title="Top 20 Bigrams")
fig.show()

Top Trigrams

In [28]:
top_k_bng_trigrams = get_top_ngrams(bow=bng_news_text, max_features=5000, ngrams=3, top_k=20)
In [29]:
top_k_bng_trigrams_df = pd.DataFrame(top_k_bng_trigrams, columns=["count"]).reset_index().rename(columns={"index": "trigram"})
In [30]:
print("Data dimensions:")
top_k_bng_trigrams_df.shape
Data dimensions:
Out[30]:
(20, 2)
In [31]:
print("First few entries:")
top_k_bng_trigrams_df.head()
First few entries:
Out[31]:
trigram count
0 karnataka high court 224
1 karnataka election 2018 134
2 kempegowda international airport 93
3 year old girl 88
4 year old boy 63
In [32]:
fig = px.bar(top_k_bng_trigrams_df, x="trigram", y="count")
fig.update_xaxes(tickangle=45)
fig.update_layout(title="Top 20 Trigrams")
fig.show()

Nouns and Verbs

In [33]:
time_start = time.time()
In [34]:
noun_bng = []
verb_bng = []
for doc in nlp_en.pipe(bng_news_text,n_threads=16,batch_size=1000):
    try:
        for c in doc:
            if c.pos_=="NOUN":
                noun_bng.append(c.text.lower())
            elif c.pos_=="VERB":
                verb_bng.append(c.text.lower())
    except:
        for c in doc:
            noun_bng.append("") 
            verb_bng.append("")
In [35]:
time_end = time.time()
print("Total time taken to obtain all nouns and verbs: "+compute_time_difference(time_start, time_end))
Total time taken to obtain all nouns and verbs: 2.84 minutes

Top Nouns

In [36]:
nouns_df = pd.DataFrame(noun_bng,columns=["noun"])
In [37]:
nouns_df_values = nouns_df.noun.value_counts().reset_index().values
nouns_cnt_df = pd.DataFrame(nouns_df_values, columns=["noun", "count"])
nouns_cnt_df.sort_index(ascending=True, inplace=True)
In [38]:
print("Data dimensions:")
nouns_cnt_df.shape
Data dimensions:
Out[38]:
(15000, 2)
In [39]:
print("First few entries:")
nouns_cnt_df.head()
First few entries:
Out[39]:
noun count
0 students 1715
1 rs 1614
2 cops 1499
3 city 1458
4 bengaluru 1415
In [40]:
top20_nouns_cnt_df = nouns_cnt_df.head(20)
In [41]:
fig = px.bar(top20_nouns_cnt_df, x="noun", y="count")
fig.update_xaxes(tickangle=45)
fig.update_layout(title="Top 20 Nouns")
fig.show()

Top Verbs

In [42]:
verbs_df = pd.DataFrame(verb_bng,columns=["verb"])
In [43]:
verbs_df_values = verbs_df.verb.value_counts().reset_index().values
verbs_cnt_df = pd.DataFrame(verbs_df_values, columns=["verb", "count"])
verbs_cnt_df.sort_index(ascending=True, inplace=True)
In [44]:
print("Data dimensions:")
verbs_cnt_df.shape
Data dimensions:
Out[44]:
(8535, 2)
In [45]:
print("First few entries:")
verbs_cnt_df.head()
First few entries:
Out[45]:
verb count
0 will 2093
1 says 1479
2 held 1357
3 may 1292
4 gets 1208
In [46]:
top20_verbs_cnt_df = verbs_cnt_df.head(20)
In [47]:
fig = px.bar(top20_verbs_cnt_df, x="verb", y="count")
fig.update_xaxes(tickangle=45)
fig.update_layout(title="Top 20 verbs")
fig.show()
In [48]:
# Total Notebook Run: End Time
total_time_end = time.time()
print("Total time taken to run the entire notebook: "+compute_time_difference(total_time_start, total_time_end))
Total time taken to run the entire notebook: 3.6424 minutes
In [ ]: